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Abstract 

We present a new non-parametric estimator of the conditional density of the kernel 
type. It is based on an efficient transformation of the data by quantile transform. By 
use of the copula representation, it turns out to have a remarkable product form. We 
study its asymptotic properties and compare its bias and variance to competitors 
based on nonparametric regression. A comparative numerical simulation is provided. 
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1 Introduction 



1.1 Motivation 



Let {{Xi,Yi);i = 1, . . . , n) be an independent identically distributed sample 
from real- valued random variables {X, Y) sitting on a given probability space. 
For predicting the response Y of the input variable X at a given location x, it 
is of great interest of estimating not only the conditional mean or regression 
function E{Y\X = x), but the full conditional density f{y\x). Indeed, estimat- 
ing the conditional density is much more informative, since it allows not only 
to recalculate the conditional expected value E{Y\X) and conditional variance 
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from the density, but also to provide the general shape of the conditional den- 
sity. This is especially important for multi-modal or skewed densities, which 
often arise from nonlinear or non-Gaussian phenomenas, where the expected 
value might be nowhere near a mode, i.e. the most likely value to appear. 
Moreover, for situations in which confidence intervals are preferred to point 
estimates, the estimated conditional density is an object of obvious interest. 



1.2 Estimation by kernel smoothing 



A natural approach to estimate the conditional density f{y\x) of Y given 
X = X would be to exploit the identity 

I N fxY{x,y) , . 

Jx[x) 

where fxv and fx denote the joint density of (X, Y) and X, respectively. By 
introducing Parzen-Rosenblatt kernel estimators of these densities, namely 

1 " 

fn,XY{x, y) : = - K'h'iXi - x)Kh{Yi - y) 
1 " 

^ i=l 

where Kh{.) = l/hK{./h) and K'f^,{.) = l/h'K\./h') are (rescaled) kernels 
with their associated sequence of bandwidth h = and h' = h'^ going to zero 
as n — s> oo, one can construct the quotient 

eR/ I N ._ fn,XY{x,y) 



fn,x{x) 



and obtain an estimator of the conditional density. Such an estimator was first 
studied by Rosenblatt [26], and more recently by Hyndman et al. [17], who 
slightly improved on Rosenblatt's kernel based estimator. 



1.3 Estimation by regression techniques 



As pointed out by numerous authors, see e.g. Fan and Yao [7] chapter 6, this 
approach is equivalent to the one arising from considering this conditional 
density estimation problem in a regression framework. Indeed, let F{y\x) be 
the cumulative conditional distribution function of Y given X = x. It stems 
from the fact that 

E (l,y_,|<;,|X = x)= F{y + h\x) - F{y - h\x) ^ 2h.f{y\x) 
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as /i — s> 0, that, if one replace the expectation in the above expression by its 
empirical counterpart, one can apply the usual local averaging methods and 
perform a regression estimation on the synthetic data {{l/2h)l\Yi-y\<h', i = 
1, . . . ,n). By a Bochner type theorem, one can even replace the transformed 
data by its smoothed version 

In particular, the popular Nadaraya- Watson regression estimator 

reduces itself to the same estimator of the conditional density of the double 
kernel type as before 

Taking advantage of this regression formulation, Fan, Yao and Tong [8] pro- 
posed a conditional density estimator which generalizes the kernel one by use 
of the local polynomial techniques. In particular, it allows to tackle with the 
bias issues of the kernel smoothing. However, and unlike the former, it is no 
longer guaranteed to have positive value nor to integrate to 1 with respect 
to y. With these issues in mind, Hyndman and Yao [18] built on local poly- 
nomial techniques and suggested two improved methods, the first one based 
on locally fitting a log-linear model and the second one on constrained local 
polynomial modeling. An overview can be found in Fan and Yao [7] (chapter 
6 and 10). Very recently, Gyorfi and Kohler [15] studied a partitioning type 
estimate and studied its properties in total variation norm and Lacour [20] a 
projection-type estimate for Markov chains. 

1.4 A product shaped estimator 

However, these two equivalent approaches suffer from several drawbacks: first, 
by its form as a quotient of two estimators, the probabilistic behavior of the 
Nadaraya- Watson estimator (or its local polynomial counterpart) is tricky to 
study. It is usually dealt with by a centering at expectation for both numerator 
and denominator and a linearizing of the inverse, see e.g. [7], or [1] for details. 
Second, at a conceptual level, one could argue that implementing regression 
estimation techniques in this setting is, in a sense, unnatural: estimating a 
density, even if it is a conditional one, should resort to density estimation 
techniques only. Finally, practical implementations of these estimators can 
lead to numerical instability when the denominator is close to zero. 
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To remedy these problems, we propose an estimator which builds on the idea 
of using synthetic data, i.e. a representation of the data more adapted to the 
problem than the original one. By transforming the data by quantile trans- 
forms and making use of the copula function, the estimator turns out to have 
a remarkable product form 

fniy\x) = /y(?/)c„(F„(x),G„(?/)) 

where /y, c„, Gn{y) are estimators of the density fy of Y, the copula 

density c, the c.d.f. F of X and G of F respectively (see next section below 
for definitions). Its study then reveals to be particularly simple: it reduces to 
the ones already done on nonparametric density estimation. 

The rest of the paper is organized as follows: in section 2, we introduce the 
quantile transform and the copula representation which leads to the definition 
of our estimator. In section 3, the main asymptotic results are established and 
compared in section 4 to those of other competitors. Proofs are mainly based 
on a series of auxiliary lemmas which are given in section 5. 



2 Presentation of the estimator 

For sake of simplicity and clarity of exposition, we limit ourselves to unidi- 
mensional real valued input variables X. However, all the results of this article 
can be easily extended to the multivariate case. 

2.1 The quantile transform 

The idea of transforming the data is not new. It has been used to improve 
the range of applicability and performance of classical estimation techniques, 
e.g. to deal with skewed data, heavy tails, or restrictions on the support (see 
e.g. Devroye and Lugosi [6] chapter 14 and the references therein, and also 
Van der Vaart [35] chapter 3.2 for the related topic of variance stabilizing 
transformations in a parametric context). In order to make inference on Y from 
X , a natural question which then arises is, what is the "best" transformation, 
if this question has a sense. As one can note from the above references, the 
"best" transformation is very linked to the distribution of the underlying data. 
We will see below that, for our problem, the natural candidate is the quantile 
transform. 

The quantile transform is a well-known probabilistic trick which is used to 
reduce proofs, e.g. in empirical process theory, for arbitrary real valued ran- 
dom variables X to ones for random variables U uniformly distributed on the 
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interval [0, 1]. It is based on the following well-known fact that whenever F is 
continuous, the random variable U = F{X) is uniformly distributed on (0, 1) 
and that conversely, when F is arbitrary, if f/ is a uniformly distributed ran- 
dom variable on (0, 1), X is equal in law to F~^{U), where = Q is the 
generalized inverse or quantile function of X. (See e.g. [28], chapter 1). 

As a consequence, given a sample (Xi, . . . , X„) of random variables with com- 
mon continuous c.d.f. F sitting on a probability space {Q,A,F), one can al- 
ways enlarge this probability space to carry a sequence (t/i, . . . , f/„) of uniform 
(0,1) random variables such that Ui = F(Xj), that is to say to construct a 
pseudo-sample with a prescribed uniform marginal distribution. 

2.2 The copula representation 

Formally, a copula is a bi-(or multi)variate distribution function whose margi- 
nal distribution functions are uniform on the interval [0, 1]. Indeed, Sklar [29] 
proved the following fundamental result: 

Theorem 2.1 For any bivariate cumulative distribution function Fx, y onH"^, 
with marginal cumulative distribution functions F of X and G of Y , there ex- 
ists some function C : [0, 1]^ [0, 1], called the dependence or copula function, 
such as 

Fx,y{x, y) = C{F{x), G{y)) , - oo < x, y < +oo. (2) 

If F and G are continuous, this representation is unique with respect to (F, G). 
The copula function G is itself a cumulative distribution function on [0, 1]^ with 
uniform marginals. 

This theorem gives a representation of the bivariate c.d.f. as a function of each 
univariate c.d.f. In other words, the copula function captures the dependence 
structure among the components X and Y of the vector (X, F), irrespectively 
of the marginal distribution F and G. Simply put, it allows to deal with the 
randomness of the dependence structure and the randomness of the marginals 
separately. 

Copulas appears to be naturally linked with the quantile transform as formula 
2 entails that G{u, v) = Fx,y{F~^{u), G~^{v)). For more details regarding cop- 
ulas and their properties, one can consult for example the book of Joe [19]. 
Copulas have witnessed a renewed interest in statistics, especially in finance, 
since the pioneering work of Deheuvels [4], who introduced the empirical cop- 
ula process. Weak convergence of the empirical copula process was investigated 
by Deheuvels [5], Van der Vaart and Wellner [36], Fermanian, Radulovic and 
Wegkamp [11]. For the estimation of the copula density, refer to Gijbels and 
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Mielniczuk [13], Fermanian [9] and Fermanian and Scaillet [10]. 

From now on, we assume that the copula function C(n, v) has a density c(n, v) 
with respect to the Lebesgue measure on [0, 1]^ and that F and G are strictly 
increasing and differentiable with densities / and g. C{u,v) and c{u,v) are 
then the cumulative distribution function (c.d.f.) and density respectively of 
the transformed variables {U,V) = {F{X),G(Y)). By differentiating formula 
(2), we get for the joint density, 

where c(u, the above mentioned copula density. Eventually, 

we can obtain the following explicit formula of the conditional density 

fvixix, y) = ^-^^^^ = 9{y)c{F{x), G{y)). (3) 



2.3 Construction of the estimator 



Starting from the previously stated product type formula (3), a natural plug-in 
approach to build an estimator of the conditional density is to use 

• a Parzen- Rosenblatt kernel type non parametric estimator of the marginal 
density gofY, 

• the empirical distribution functions and Gn{y) for F{x) and G{y) 
respectively, 

2 n 1 " 

Fn{x) = - XI '^Xj<x and Gn{y) := - X ^Yj<y 

^ i=l j=l 

Concerning the copula density c{u, v), we noted that c{u, v) is the joint density 
of the transformed variables {U,V) = {F{X),G(Y)). Therefore, c{u,v) can 
be estimated by the bivariate Parzen-Rosenblatt kernel type non parametric 
density (pseudo) estimator, 

,„M:^^±k("-^,"-^) (4) 

nanOn ~^ V a„ On / 

where K is a bivariate kernel and a„, bn its associated bandwidth. For simplic- 
ity, we restrict ourselves to product kernels, i.e. K{u,v) = Ki{u)K2{y) with 
the same bandwidths an = bn- 
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Nonetheless, since F and G are unknown, the random variables (f/j, Vi) are not 
observable, i.e. c„ is not a true statistic. Therefore, we approximate the pseudo- 
sample {Ui, Vi), i = 1, . . . , n by its empirical counterpart = 
1, . . . , n. We therefore obtain a genuine estimator of c{u, v) 

.A ■= ^ t /f, f"-^-™) f"-^-y-)) ^ (5) 



Eventually, the conditional density estimator is written as 

fn{y\x) := 



na: 



n i=l 



or, under a more compact form, 

fn{y\x) := gn{,y)Cn{,Fn{x),Gn{y)). 



(6) 



Remark 1 To our knowledge, the estimator studied in this paper has never 
been proposed in the literature. However, some connections can be made with 
the nearest neighbor one proposed by Stute [32], [33] and [34] for conditional 
cumulative distribution function and the Gasser and Miiller [12] and Priestley 
and Chao [24] one in the context of regression estimation. Indeed, these esti- 
mators tackle the issue of having a random denominator by first transforming 
the design to a uniform (random) one. This result in assigning 

the surfaces under the kernel function instead of its heights as weights. Con- 
trary to our estimator, they do not make transformations of the data in both 
directions X and Y . 



3 Asymptotic results 



3.1 Notations and assumptions 



We note the ith moment of a generic kernel (possibly multivariate) K as 
mi{K) := J u^K{u)du, and the Lp norm of a function h by \\h\\p := J h^. We 
use the sign ~ to denote the order of the bandwidths, i.e. hn — Un means that 
hn = CnUn with c„ — c > 0. The support of the densities function / and c are 
noted as supp(/) = {x G R; /(x) > 0} and supp(c) = {{u,v) G Mj^;c{u,v) > 
0}, respectively. 

For stating our results, we will have to make some regularity assumptions 
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on the kernels and the densities which, although far from being minimal, are 
somehow customary in kernel density estimation (see subsection 5.2 for discus- 
sions and details). Set x and y two fixed points in the interior of supp(/) and 
supp((7) respectively. In the remainder of this paper, we will always suppose 
that 

i) the c.d.f F of X and G of F are strictly increasing and differentiable; 

ii) the densities g and c are twice differentiable with continuous bounded 
second derivatives on their support. 

Moreover, we assume that the kernels Kq and K satisfy the following: 

(i) K and Kq are of bounded support and of bounded variation; 

(ii) < Ji < C and < J^o < C for some constant C; 

(iii) K and Kq are first order kernels: rriQ^K) = 1, mi{K) = and m2{K) < 
+00, and the same for Kq. 

In addition, in order to approximate Cn by c„, we will impose the slightly more 
stringent assumption on the bivariate kernel K, that it is twice differentiable 
with bounded second partial derivatives. 



3.2 Weak and strong consistency of the estimator 



We have the following pointwise weak consistency theorem: 

Theorem 3.1 Let the regularity conditions on the densities and kernels be 
satisfied, if hn and an tends to zero as n ^ oo in such a way that nhn oo, 
na^ oo, then 

fn{y\x) = f{y\x) + Op + hl + ^1= + al]. 



Proof. Recall from 4 and 5 that Cn and Cn are estimators of the copula density 
c based respectively on unobservable pseudo-data (F(Xj), G(Fj), and their 
approximations (F„(Xi), G„(Fj)). The main ingredient of the proof follows 
from the decomposition: 

Uy\^) - f{y\^) = gn{y)cn{Fn{x), Gn{y)) " g{y)c{F{x), G{y)) 
= [9n{y) - g{y)] c„(i^„(x), Gn{y)) 
+ giy) [c„(F„(x), Gn{y)) - c(F(x), G{y))] 
: = Di + D2 



8 



We proceed one step further in the decomposition of each terms, by first 
centering at fixed locations, 

Di = [Uy) - 9{y)] [£n(Fn(x), GM) - £n{F{x), G{y))] 

+ My)-giy)] [c„(F(x),G(i/)) - c„(F(x),G(y))] 

+ My) - giy)] [c„(F(x), G{y)) - ciF{x), G{y))] 

+ My)-g{y)]HF{x),G{y))] (7) 

D2 = 9{y) [dn{Fn{x), GM) - c4F{x), G{y))] 
+ giy) [c„(F(x), G{y)) - c„(F(x), Giy))] 

+ g{y) [c„(F(x), G{y)) - c{F{x), G{y))] (8) 
Convergence results for the kernel density estimators of section 5.2 entail that 

9n{y) - g{y) = Op{hl + 1/ ^/nh^) 
c^{F{x), G{y)) - c{F{x), G{y)) = 0,{al + l/yW^) 

by lemma 5.2 and 5.3 respectively. Approximation lemmas 5.4 and 5.5 of 
sections 5.4 and 5.5 entail that 

c„(F(x), G(y)) - c„(F(x), G{y)) = op{al + l/y^) 
Cn(F„(x), Gniy)) - Cn{F{x), G{y)) = op{al + 1/ yno^"). 

We therefore obtain that 

D, = Op (^hl + l/yr^) Op l^al + i/y^) + Op (^hl + 
D2 = Op (^al + l/i/na2 j + Op (^a^ + 1/ yn^j 

and the condition a„ — 0, /i.„ — * 0, na"^ — > +00, — ^ +00 entails the 
convergence of the estimator. □ 

Remark 2 As a corollary, we get the rate of convergence, by choosing the 
bandwidths which balance the bias and variance trade-off: for an optimal choice 
of hn — n~^/^ and a„ ~ n~^/^ , we get 

fiy\x) = fiy\x) + Opin-'/'). 

Therefore, our estimator is rate optimal in the sense that it reaches the mini- 
max rate n~^^'^ of convergence, according to Stone [30]. 

Almost sure results can be proved in the same way: we have the following 
strong consistency result 
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Theorem 3.2 Let the regularity conditions on the densities and kernels be 
satisfied. If in addition nhn/ (}ti Inn) — > oo and na^/ (In Inn) oo , then 

Proof. It follows the same lines as the preceding theorem, but uses the a.s. 
results of the consistency of the kernel density estimators of lemmas 5.2 and 
5.3 and of the approximation lemmas 5.4 and 5.5. It is therefore omitted. □ 

Remark 3 For hn — (Inlnn/ra)^/^ and a„ ^ (Inlnn/n)^/^ which is the op- 
timal trade-off between the bias and the stochastic term, one gets the optimal 
rate (Inlnn/ra)^/'^. 

3.3 Convergence in distribution 

Theorem 3.3 Let the regularity conditions on the densities and kernels be 
satisfied, hn — ^ 0, a„ — * 0, nhn — > oo and na^ — > oo entail 

^|^n{Uy\x)-f{y\x))^M{Q,g{y)f{y\x)\\K\\t). 

For hn — n~^l^ , an — n~^^^ one gets the usual rate n^^/"^ . 

Proof. With the conditions on the bandwidths, all the terms in the pre- 
vious decomposition 7 and 8, are negligible compared to (na^)"^/^ except 
Cn{F{x),G{y)) — c{F{x),G{y)), which is asymptotically normal by the result 
of section 5, lemma 5.3 

^/^n9{y) [cn{F{x), G{y)) - c(F(x), G{y))] ^ M (o, g\y)c{F{x), G{y)) . 

An application of Slutsky's lemma yields the desired result. □ 

For a vector {yi, . . . ,yd), one can get a multidimensional version of the con- 
vergence in distribution (fidi convergence): 

Corollary 3.4 With the same assumptions, for {yi, . . . ,yd) in the interior of 
supp{g) such that g{yi)f{yi\x) ^ 0, 

\\\|g{y^)f{y^\x) \\k\\^ 

where N^"^^ is the standard m-variate centered normal distribution with iden- 
tity variance matrix. 



, m 



d 



iV{m) 
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Proof. It simply follows from the use of the Cramer- Wold device and is there- 
fore omitted. For details, see e.g. [1], theorem 2.3. □ 



3.4 Asymptotic Bias, Variance and Mean square error 



The asymptotic bias is calculated in the following proposition. 



Proposition 3.5 With the assumptions of Theorem 3.1, we have 



Bo := E{Uy\x)) - f{y\x) = g{y)BK{c, x,y)f + o(a^) 



with B}<c{c,x,y) := m2{Ki) 



ciF{x),G{y)) 



c(F(x),G{y)) 



Proof. (Sketch). By taking expectation in the decomposition 7 and 8, 



where we made appear the bias of and c„ and where R\ and K2 stand for 
the remaining terms. With the assumptions on the bandwidths and derivations 
made tedious by the transformation of the data by the empirical margins, (see 
Fermanian [9] theorem 1 for such a calculation) , the terms in R2 are negligible 
compared to the bias of c„. The bias of c„, which is simply the bias of a 
bivariate kernel density estimator, is of order a^. Similarly, by bounding the 
product terms in D\ by Cauchy-Schwarz inequality, routine analysis show that 
the terms in R\ are negligible compared to the bias of which is of order 
/i^. Since /i^ is itself negligible to a^, the main term in the decomposition is 
g{y)E{cn{F{x),G{y)) — C{F{x),G{y))). Plugging the expression of the bias 
given in lemma 5.3, yields the desired result. □ 

The asymptotic variance has already been derived in theorem 3.3, 



Together with the computation of the asymptotic bias, we get the asymptotic 
mean squared error as a corollary: 

Corollary 3.6 With the previous assumptions, the Asymptotic Mean Squared 



ED^ = c{F{x), G{y))E[Uy) - g{y)] + R, 

ED2 = g{y)E ([c„(F(x), G{y)) - c{F{x), G{y))]) + R2 



Vo := VarU{y\x)) = l/{nal)g{y)f{y\x)\\K\\l + o{l/{nal)). 



Error (AMSE) Eq at (x, y) ts 



^0 ■= Bq + Vq 

aW{y){Bk{c,x,y)f , g{y)f{y\x)\\K 




1 



) 
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which gives, for the choice of the usual bandwidths mentioned above, 



4 Comparison with other estimators 

4.1 Presentation of alternative estimators 



For convenience, we recall below the definition of other estimators of the con- 
ditional density encountered in the literature and summarize their bias and 
variance properties. We will note the bias of the ith estimator fnivl^) by Ei 
and its variance by V^. 

(1) Double kernel estimator: as defined in the introduction section of our 
paper by the following ratio, 

fi'\y\x) := -^^^ • 

i=l 

where hi and /i2 are the bandwidths. One then have, see e.g. [17], 
• Bias: 

hlm^jK) fj'{x)df{y\x) , d'fiy\x) , fh,Yd'f{y\x) 



f{x) dx dx^ V^i/ ^iP' 

+ o (/z? + ht) 

Variance: 



nh\h2f{x) ^ ' \nh\h-^ 

(2) Local polynomial estimator: Set 



i=l 



R{B, x,y):=Y. iK,, (Y, - y) - 9, (X, - xY \ (X, - x 



then the local polynomial estimator is defined as 

where O^y ■= {Oq, 9i, . . . , Or) is the value of 6 which minimizes R{6, x, y). 
This local polynomial estimator, although it has a superior bias than 
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the kernel one, is no longer restricted to be non-negative and does not 
integrate to 1, except in the special case r = 0. From results of [S], we 
get for the local linear estimator (see also [7] p. 256), 

• Bias: 

him2{K')d'fiy\x) , hlm,{K)d^fiy\x) , , 

= — 2 + ^ 9^ + o{h, + h,) 

• Variance: 

\K\\l\\K'\\lfiy\x) , 



Vo 



nhih2f{x) \nh1h2 
(3) Local parametric estimator: As in [18] and [7], set 



x, y) := ^ {KhM -y)- Mx^ - x, e)f K'^^ (X, 



X) 



i=l 



where A{x,e) = I {j:"j=odjiXi - xY) and /(.) is a monotonic function 
mapping R t-^ e.g. l{u) = exp(u). Then, 

fi'\y\x):=A{0,e)=l{eo). 

• Bias: 

2^^r^>,f9'fiy\x) d'AiO,e,y)\ , hlm2{K)d'f{y\x) 



dx^ dx"^ J 2 dy'^ 

+ o{hl + hi) 



• Variance: 

T{Kj<rfiy\^) I J 1 

nhih2f{x) \nh1h2, 

where rj and r are kernel dependent constants. 
(4) Constrained local polynomial estimator: A simple device to force 
the local polynomial estimator to be positive is to set 6*0 = exp(Q;) in 
the definition of Rq to be minimized. The constrained local polynomial 
estimator f^{y\x) is then defined analogously as the local polynomial 
estimator f^{y\x). We have, as in [IS] and [7]: 

• Bias: 

_ , m2{K')d'f{y\x) ^, m2{K)d'f{y\x) , 

■= /^i^ + ^ + o{K + h2) 

• Variance: 

\K\\lf{y\x) , 



v. 



4 



nhih2f{x) \nhih 
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4-2 Asymptotic Bias and Variance comparison 



All estimators have (hopefully) the same order n^^^^ and n~^/'^ in their asymp- 
totic bias and variance terms, for the usual bandwidths choice. The main 
difference lies in the constant terms which depend on unknown densities. 

Bias: Contrary to all the alternative estimators whose bias involves derivatives 
of the full conditional density, one can note that our estimator's bias only 
involves the density of Y and the derivatives of the copula density. To make 
things more explicit, the terms involved, e.g. in the local polynomial estimator, 
write themselves as the sum of the derivatives of the conditional density. 



d'^f{y\x) , d^f{y\x) 



that is to say. 



dc{F{x),G{y)) , ,2, ^ , ,d^c{F{x),G{y)) 



h-'B, ^ f\x)g{y) ' + f\x)g{y) 



2 



^ dc{F{x),G{y)) , 3 ^ d'c{F{x),G{y)) 
+ 2(7 {y)g{y) -^^ + g {y) -^^ 

whereas our {g{y)/2)BK{c,x,y) term, modulo the constants involved by the 
kernel, is written as 



a^^Bo ^ g{y) 



' d'c{F{x),G{y)) ^ d'c{F{x),G{y)) \ 



It then becomes clear that we have a simpler expression, with less unknown 
terms, as is the case for competitors which do involve the density / and its 
derivative /' of X and the derivative g' of the Y density. 

In a fixed bandwidth and asymptotic context, it seems difficult to compare 
further. Nonetheless, we believe this feature of our estimator would be practi- 
cally relevant when it comes to choosing the bandwidths. Indeed, bandwidth 
selection is usually performed by minimizing local or global asymptotic error 
criteria such as Asymptotic Mean Square Error (AMSE) or Asymptotic Mean 
Integrated Square Error (AMISE), in which unknown terms have to be esti- 
mated. Since in our approach, the asymptotic bias and variance involve less 
unknown terms, we expect that a higher accuracy could be obtained in this 
pre-estimation stage. Moreover, by having managed to separate the estimation 
problem of the marginal from the copula density, we could use known optimal 
data-dependent bandwidths selection procedures for density estimation such 
as cross validation, separately for the density of Y and for the copula density. 

Remark 4 Since the copula density c has a compact support [0, 1]^, our esti- 
mator may suffer from bias issues on the boundaries, i.e. in the tails of X and 
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Y . To correct these issues, one could apply one of the several known techniques 
to reduce the bias of the kernel estimator on the edges (see e.g [7] chapter 5.5, 
boundary kernels, reflection, transformation and local polynomial fitting). In 
the tail of the distribution of X , this bias issue in the copula density estimator 
is balanced by the improved variance, as shown below. 

Variance: The variance of our estimator involves a product of the density 
g{y) of Y by the conditional density f{ii\x), 

nalVo ^ g{y)f{y\x) = g\y)c{F{x), G{y) 

whereas competitors involve the ratio of f{ii\x) by the density /(x) of X 

f{y\x) g{y) 



fix) fix) 



c{F{x),Giy)). 



It is a remarkable feature of the estimator we propose, that its variance does 
not involve directly fix), as is the case for the competitors, but only its contri- 
bution to Y, through the copula density. This reflects the ability announced in 
the introduction of the copula representation to have effectively separated the 
randomness pertaining to Y alone, from the dependence structure of {X,Y). 
Moreover, our estimator also does not suffer from the unstable nature of com- 
petitors who, due to their intrinsic ratio structure, get an explosive variance 
for small value of the density fix), making conditional estimation difficult, 
e.g. in the tail of the distribution of X. 

Remark 5 To make estimators comparable, we have restricted ourselves to 
so-called fixed bandwidths estimators, i.e. nonparametric estimators where the 
bandwidths are of the generic form = bn° or hn = 6(lnn/n)" with a 
and b real numbers. Improved behavior for all the preceding estimators can be 
obtained with data- dependent bandwidths where hn = . . . , x) can 

be functions of the location and of the data. 



4.3 Finite sample numerical simulation 



4.3.1 Practical implementation of the estimator 

Although the proposed estimator seems to compare favorably asymptotically, 
some pitfalls linked to the copula density estimation may show up in the 
practical implementation: 

Infinities at the corners: many copula densities exhibit infinite values at 
their corners. Therefore, to avoid that be equal to (1, 1), we 

change the empirical distribution functions F„ and Gn to n/(n + and 
n/ in + l)^^ respectively. 
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Boundary bias: since the copula density is of compact support [0, 1]^, the 
kernel method of estimation may suffer from boundary bias. To alleviate this 
issue, we suggest to use boundary-corrected kernels such as the beta kernels 
Kx,b{t) = Px/b+i,{i-x)/b+i{t), where f3a,b{t) denotes the pdf of a Beta(a,b) dis- 
tribution, advocated by Chen [2], and used e.g. by [1-1] for estimating loss 
distributions. The modified copula density pseudo estimator is thus defined as 
c„(n, v) = Er=i Ku,aM)Kv,aAVi). 

Bandwidth selection: performance of nonparametric estimators depends 
crucially on the bandwidths. For conditional density, bandwidth selection is a 
more delicate matter than for density estimation due to the multidimensional 
nature of the problem. Moreover, for ratio-type estimators, the difficulty is 
increased by the local dependence of the bandwidths hy on hx implied by con- 
ditioning near x. For the copula estimator, a supplemental issue comes from 
the fact that the pseudo-data F{Xi),G{Yi) is not directly accessible. Inspec- 
tion of the AMISE of the copula-based estimator suggest we can separate the 
bandwidth choice of h for g{y) from the bandwidth choice of a„ the copula 
density estimator c^. A rationale for a data-dependent method is to separately 
select h on the Yi data alone (e.g. by cross-validation or plug-in), from the a„ 
of the copula density c based on the approximate data Fn{Xi),Gn(Yi). How- 
ever, such a bandwidth selection would require deeper analysis and we leave a 
detailed study of a practical data-dependent method for bandwidth selection 
of the copula-quantile estimator, together with a global and local comparison 
of the estimators at their respective optimal bandwidths for further research. 



4-3.2 Model and comparison results 

We simulated a sample of n = 100 variables (Xj, Yj), from the following model: 
X, Y is marginally distributed as Af{0, 1) and linked via Frank Copula . 



ln[(g + g-+--g--r)/(g-l)] 
C{u, V, 6) = — 



with parameter 9 = 100. 

We restricted ourselves to simple, fixed for all x,y, rule-of-thumb methods 
based on Normal reference rule to get a first picture. For the selection of 
of the copula density estimator, we applied Scott's Rule on the data 
We used Epanechnikov kernels for g{y) and the other estimators. We plotted 
the conditional density along with its estimations on the domain x G [—5, 5] 
and y G [—3, 3] on figure 1. A comparison plot at x = 2 is shown on figure 2. 
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Figure 1. 3D Plots. Prom left to right, top to bottom: true density, quantile-copula 
estimator, double kernel, local polynomial (clipped). 

0.7 r 



/ , - - \ 0.6 : 




-3-2-10 1 2 3 



Pigure 2. Comparison at x=2: conditional density=thick curve, quantile-cop- 
ula=continuous line, double kernel=dotted curve, local polynomial=dashed curve. 
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4-3.3 Clipping and Estimation in the tails 



As mentioned earlier, as the performance of the estimators depends on the 
performance of the bandwidths selection method, it is delicate to give a con- 
clusive answer. However, we would like to illustrate at least one case where 
the proposed estimator clearly outperforms its competitors. Indeed, one major 
issue of alternative estimators already mentioned is their numerical explosion 
when the estimated density f{x) is close to zero. In particular, if the kernel is 
of compact support, the denominator is zero for the x whose distance from the 
closest Xi exceeds half the bandwidth times the length of the support, thereby 
allowing estimation only on a closed subset of X included in [minXj, maxXi]. 
This is one of the reason why simulation studies are often performed either 
with a marginal X density of bounded support and/or with a Gaussian ker- 
nel. Note that the problem remains with a Gaussian kernel since the estimated 
density can become quickly lower than the machine precision. To prevent from 
this numerical explosion, the definition of the conditional density estimators 
have to be modified either by 



where c > is an arbitrary amount of clipping, and a(.) is an arbitrary density 
estimator (usually chosen to be zero or g{y)). 

An illustration of these issues clearly appears in figure 1. The undipped version 
of the double kernel estimator is unable to estimate the conditional density for 
I a; I roughly > 3, and the clipped version of the local polynomial estimator with 
c = 0.00001 and a{y) = g{y) gives a wrong estimation in the tails, refiecting 
the arbitrary choices in the clipping decision. To the contrary, the quantile- 
copula estimator is surprisingly able to estimate the conditional density f{y\x) 
at locations x where there is "no data", i.e. in the tails of the distribution of 
X . An explanation of this apparently paradoxal phenomenon comes from the 
fact that the estimator is partially based on the ranks of Xj and 1^. Therefore, 
it can recover "hidden" information on the density of X from the ordering of 
the pairs See Hoff [16] for a detailed explanation. We believe that 

this feature might be of potential interest for applications, e.g. in statistical 
inference of extreme values and rare events. 



Discussion 

The quantile transform and use of the copula formula has thus turned the con- 
ditional density formula (1) of the ratio type into the product one (3). This 
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formula was the backbone of our article where this product form appeared 
to be especially appealing for statistical estimation: consistency and limit re- 
sults where obtained by simple combination of the previous known ones on 
(unconditional) density estimation. The estimator obtained shows interesting 
asymptotic bias and variances properties compared to competitors. Although 
its finite sample implementation does not give yet a clear and conclusive pic- 
ture, it already yields some promising results, e.g. for estimation in the tails 
of X, where the proposed estimator does not suffer from clipping issues. 



5 Appendix : auxiliary results 

In this section, we gather some preliminary results which we will need as basic 
tools for the demonstrations of section 3. In subsection 5.1, we recall classical 
results about the convergence of the Kolmogorov-Sminorv statistic. Next, we 
make a brief overview of kernel density estimation and apply these results to 
the estimators Qn (section 5.2) and c„ (section 5.3). Eventually, we need two 
approximation lemmas of c„ by c„ in sections 5.4 and 5.5. 

5.1 Approximation of the pseudo-variables F[Xi) by their estimates F„(Xj) 

For (Xj, i = 1, . . . , n) an i.i.d. sample of a real random variable X with common 
c.d.f. F, the Kolmogorov-Smirnov statistic is defined as Dn := \\Fn — F\\^. 
Glivenko-Cantelli, Kolmogorov and Smirnov, Chung, Donsker among others 
have studied its convergence properties in increasing generality (See [28] and 
[36] for recent accounts). For our purpose, we only need to formulate these 
results in the following rough form: 

Lemma 5.1 For an i.i.d. sample from a continuous c.d.f. F , 



Since F is unknown, the random variables Ui = F{Xi) are not observed. As a 
consequence of the preceding lemma 5.1, one can naturally approximate these 
variables by the statistics F„(Xj). Indeed, 




(10) 




|F(X,) - F„(X,)| < sup \F{x) - F„(a;)| = - F 



oo 



a.s. 
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Thus, \F{Xi)-Fn{Xi)\ is no more than an Op{(\n\nn/ny/'^) or an Oa.s.in''^/'^). 
These rates of approximation appears to be faster than those of statistical 
estimation of densities, as is shown in the next subsection. 

5.2 Convergence of the kernel density estimator Qn 

We recall below some classical results about the convergence of the Parzen- 
Rosenblatt kernel non-parametric estimator of a d-variate density. Since its 
inception by Rosenblatt [25] and Parzen [22], it has been studied by a great 
deal of authors. See e.g. Scott [27], Prakasa Rao [23], Nadaraya [21] for details. 
See also Bosq [1] chapter 2. 

It is well known that the bias of the kernel density estimator depends on the 
degree of smoothness of the underlying density, measured by its number of 
derivatives or its Lipschitz order. In order to get the convergence of the bias 
to zero, it suffices to assume that the density is continuous (See [22]). To get 
further information on the rate of convergence of the estimator, it is necessary 
to make further assumptions. Moreover, for kernel functions with unbounded 
support, the rate of convergence also depends on the tail behavior of the 
kernel (See Stute [31]). Therefore, for clarity of exposition and simplicity of 
notations, we will make the customary assumptions that the density is twice 
differentiable and that the kernel is of bounded support. We then have the 
following results: 

• Bias: With the previous assumptions, for a a; in the interior of supp{f), 
h„ ^ and n/if — * oo entail that 



With the multivariate kernel K as a product of d order one kernels Ki, the 
above sum reduces to the diagonal terms. 



'n 




EUx) = fix) + ^ E 



l<i<d 



'fix) 
dxf 



+ o{hl). 



• Variance: with the same assumptions, 





• Pointwise asymptotic normality: under the previous conditions, 



(/„(x) - eUx)) - Ar(0, fix) \\K\\l). 
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For a choice of the bandwidth as /i„ ~ n~^/^'^^^\ which realizes the optimal 
trade-off between the bias and variance, one gets the rate n~'^/^'^~^^\ which 
is the optimal speed of convergence in the minimax sense in the class of 
density functions with bounded second derivatives, according to [30]. 
Pointwise almost sure convergence: if moreover n/ij^/(lnlnn) — » oo (see [3]), 
we have that 



fn{x) - Efn{x) = Oa.: 



/In Inn' 



n. 



hi 



For a choice of the bandwidth as hn — ((lnlnn)/n)^^*''^'''^'*, we get the rate 
of convergence ((lnlnn)/n)^'''''^^^'': 



fn{x) - f{x) = Oa.: 



lnlnn\ 
n I 



2/{d+iy 



Applied to our case {d = 1), we can summarize these results for further ref- 
erence in the following lemma for the estimator cjn of the density g of Y: 



Lemma 5.2 With the previous assumptions, for a point y in the interior of 
the support of g, and a bandwidth chosen such as hn — n~^l^ , we have 

\9n{y) - g{y)\ = 0,in~'/') 
n'^' [gniy)-9iy)]^^f{o,giy) ll^ol 

With the same assumptions, but for a bandwidth choice of hn — (Inlnn/n)^/^, 

9n{y) - g{y) = O^.,. \ i^^Y'] . (ii) 



5.3 Convergence of Cn{u,v) 



As mentioned before, the assumptions that F and G be differentiable and 
strictly increasing entail that c is the density of the transformed variables 
{U,V) := {F{X),G{Y)). Therefore, once one convinces oneself that Cn{u,v) 
is simply the kernel density estimator of the bivariate density c{u, v) of the 
pseudo- variables {U,V), one directly draws its convergence properties by ap- 
plying the results of the preceding subsection with d = 2: 

Lemma 5.3 For a choice of an — rr^l'° , for every {u,v) G (0,1)^, similar 
results of those of lemma 5.2 hold for Cn with a rate of convergence of n"^^^ 
and (Inlnn/n)^/^ respectively. 
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5-4 An approximation lemma of Cn{u,v) by Cn{u,v) 



The lemma of this section gives the rate of approximation of the kernel copula 
density estimator Cn{u,v) computed on the real data {Fn{Xi),Gn(Yi)) by its 
analogue Cn{u,v) computed on the pseudo-data {Ui,Vi) := {F{Xi),G(Yi)). A 
similar result, but with a different proof, has been obtained in Fermanian [9] 
theorem 1. 

Lemma 5.4 Let {u,v) G (0, 1)^. // the kernel K{u,v) = Ki{u)K2{v) is twice 
differentiable with bounded second derivatives, then 

\cn{u,v) - Cn{u,v) \ = 0p{a\ + l/yno^) 

\Cn{u,v) - Cn{u,v) \ = O, 



/ In In n 



nat 



Proof. We note ||.|| a norm for vectors. Set A := Cn{u,v) — Cn{u,v) 

n 

J2 Ai^n{u,v) with 



i=l 



^ ^ f u - FnjX,) v-G^{Yi) \ _ ^ f u - F{Xi) v_-G{Xd\ 



and define 

^ ( F{X^) - F^X,) 

As mentioned in section 5.1, — F{Xi)\ < — F\\oo and \Gn{Yi) — 

G(Yi) \ < \ \Gn — CI loo a.s. for every i = 1, . . . ,n. Lemma 5.1 thus entails that 
the norm of Zi^n is independent of i and such that 



Op(l/v^) ,1 = 1, 



,n 



Oa.s.(vlnlnn/n) , i = I, 



,n 



(12) 
(13) 



Now, for every fixed {u,v) G [0, 1]^, since the kernel K is twice differentiable, 
there exists, by Taylor expansion, random variables f/j,„ and Vi^n such that, 
almost surely. 



na 



n 1=1 



u - F{X,) v-G{Yi) \ 

Clfi On ) 

(u-Ui„ V - Vi, 



Zin '■— Al + Aj 
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where Zj^ denotes the transpose of the vector Zi^n and \JK and V^-fC the 
gradient and the Hessian respectively of the multivariate kernel function K 





1 


dK \ 




( ^ 




\ 


VK = 




du 






dudv 






\ 


dK 






d'^K 


J 




dv ) 




\ dudv 


dv^ 



Negligibility 0/A2: By the boundedness assumption on the second-order deriva- 
tives of the kernel, and equations 12 and 13, 



Negligibility 0/ Ai: By centering at expectations, 

A. ^ ^ t ZL (VK h-^, ..)- EVK (--^ 



"^K i^l \ a-n an J 



Negligibility 0/ A12: Bias results on the bivariate gradient kernel estimator (See 
Scott [27] chapter 6) entail that 



c (m, v) + 0{a 



Cauchy-Schwarz inequality yields that 



A12I < 



n\\Z. 



i,n I 



EVK 



In turn, with equations 12 and 13, 



A12 = Op{l/^/n) and A12 = Oa.s(y In In ra/n). 
Negligibility 0/ Am Set A = VK ( "~^^^'\ . . .) - EVK 



na: 



n i=l 



|. Then, 



Boundedness assumption on the derivative of the kernel imply that 1 1 A| | < 2C 
a.s. We apply Hoeffding inequality for independent, centered, bounded by M, 
but non identically distributed random variables (r/-,) (e.g. see [1]), 



PiT. Vj >t)< exp 



(14) 
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Here, for every e > 0, with M = 2C, r]i = \\Ai\\ - E\\Ai\\, t = en^/'^(\n\nny/'^ , 
we get that 



P (J^^'i^^ ill A\ \ - E\\Ai\\) > eVnlnlnn) ^ exp 



In Inn 



1 



4M2 



with a 6 > and where the r.h.s. goes to zero as n 
E7=i - E\\A,\\) = Op{Vn\n\nn). 



oo. Therefore, 



For the almost sure negligibility, we get similarly by inequality 14 that, for 
every e > 0, with t = en^^+^^Z^ and 6 > 0, 



4M2 



and the series on the r.h.s is convergent. In turn, the Borell-Cantelli lemma 
imply that Eti - E\\A,\\) = 0,.,.(n(i+^)/2). 

It remains to evaluate First, we have that £'||y4j|| < 2E\\'VK{{u — 

F{Xi)) / Qn, • • OH - Second, since K is differentiable and of product form K{u, v) 
= Ki{u)K2{v), each sub-kernel is of bounded variations and can be written 
as a difference of two monotone increasing functions. For example, set Ki = 
- K\ and define K* := {K^ + K\)K2. We have. 



dK 



du 



^ mi)'\ + \{Kt)'\)K, = {{Kiy + {Kiy)K, 



dK* 
du 



where the equality proceeds from the positivity of the derivatives. As a con- 
sequence. 



E 



dK 



du 



{{u-F{X,))/an, 



< E 



dK* 
du 



((n-F(X,))/a„,...) 



and similarly for the other partial derivative. The r.h.s. of the previous inequal- 
ity is, after an integration by parts, of order by the results on the kernel 
estimator of the gradient of the density (See Scott [27] chapter 6). Therefore, 
j:tiE\m=0{nal). 



Recollecting all elements, we eventually obtain that 
An = Op (^^^^^+J<_]=o 



^11 



Vln In 



n 



p 



n 



Op 



nat 



n^'^+^y^ + nal /in In 



n 



nat 



n 



/In Inn 



+ 



/ In In n 



n 



/In Inn' 



nat 
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for S small enough (< 1/3 for an — n ^/^). □ 

5.5 An approximation lemma for Cn{Fn{x),Gn{y)) by Cn{F{x),G{y)) 

The lemma of this subsection gives the rate of deviation of the kernel copula 
density estimator c„ from a varying location (F„(x), Gniy)) to a fixed location 

Lemma 5.5 With the same assumptions as in the preceding lemma, we have 

c„(F„(x), Gn{y)) - Cn{F{x), G{y)) = op (al + 

/in Inn 



rial 



Cn{Fn{x), Gn{y)) - c„(F(x), G{y)) = Oa.s. 



n 



Proof. We proceed similarly as in the preceding lemma. Set 

1 " 

A„(x, y) := c4F„(x), Gn{y)) - Cn{F{x), G{y)) = — E v) (15) 

'^'^n i=l 

with 

' Fn{x) - Fn{X,) Gn{y) - Gn{Yi)\ 



A:^Jx,y) ■.= K 
and define 



F{x)-Fn{X,) G{y)-Gn{Yi)' 



Zn{x,y) :-- 



Fjx)-F(x) 



Gn{y) - G{y) 



We first express y) at a fixed location {F{x), G{y)) by a Taylor expan- 

sion and by bounding uniformly the second order terms, 



.VK (F{x)-Fn{X,) G{y)-Gn{Y,)\ , \\Z, 



|2 



(16) 



where Ri is uniformly bounded almost surely: Ri = Oa.sX^)- We then go from 
the data (F„(Xj), Gn{Yi)) to the pseudo but fixed w.r.t. n data {F{Xi), G{Yi)). 
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By a second Taylor expansion, 

Ojyi y Oin (In J 

^ VK_ / F{x)-F{X,) G{y)-G{Y^ \ 

^^T^ f F{x)-F{X,) ^ G{y)-G{Y^ \ ^ IjM^lR^, (17) 

where R2 = Oa.s.(l) uniformly in i, x and y. Therefore, plugging 16 and 17 in 
15, we get 

A„(^, y) = ^ni^,y) ^ f F{x)-F{X,) ^ G{y)-G{Y,) \ 

Z^ix^- V^K ( F{x)-F{X,) G{y)-G{Y^ \ 
nal 2< 1 ^ ' ^ 1 



loo 



with the remainder term i?3 = Oa.s.(l) uniformly. As before, the properties of 
the kernel (derivate) density estimator (See Scott [27] chapter 6) entails that 

1 ^„,_,l F(x)-F(Xi) G(y)-G{Y,) \ „ , 2 ^ , , /-J, 
i^g^'H a„ ■ a,, j =Op(a„ + 

Therefore, using 12 and bounding uniformly the Hessian, 15 becomes 
A„(x,y) = Op [alWZ^U + ] + Op (^^) 

Op \al + 



nai 



Similarly, one gets with 13 and the strong consistency of the estimator of the 
gradient of the density that A„(x, y) = Oa.s. {\f^^) ■ ^ 
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